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Abstract: Let (X, Y) a X X y he a. random couple with unknown distri- 
bution P. Let be a class of measurable functions and I a loss function. 
The problem of statistical learning deals with the estimation of the Bayes: 

9* =argminEp%(X),y). 

In this paper, we study this problem when we deal with a contaminated 
sample {Zi, Ki), . . . , (Z„, y„) of i.i.d. indirect observations. Each input Zi, 
j = 1, . . . , n is distributed from a density A/, where A is a known compact 
linear operator and / is the density of the direct input X. 
We derive fast rates of convergence for empirical risk minimizers based on 
regularization methods, such as deconvolution kernel density estimators or 
spectral cut-off. These results are comparable to the existing fast rates in 
Koltcliinskii [2006] for the direct case. It gives some insights into the effect 
of indirect measurements in the presence of fast rates of convergence. 

1. Introduction 

In many real-life situations, direet data are not available and measurement er- 
rors occur. In many examples, such as medecine, astronomy, econometrics or 
meteorology, these measurement errors should not be neglected. Let us consider 
the following example from signal processing in oncology. Medical images (such 
as scanner, magnitude resonance imaging) play an increasingly important role 
in diagnosing and treating cancer patients. In the clinical setting, imaging data 
allows to better evaluate whether a cancer patient is responding to therapy and 
to adjust the therapy accordingly. In such a setting, the response variable could 
be the total response to the treatment, a partial response or the absence of a 
response. However, image interpretation and management in clinical trials trig- 
gers a number of issues such as doubtful reliability of image analysis due to a 
high variability in image interpretation, censoring bias, and a number of oper- 
ational issues due to complex image data workflow. Consequently, biomarkers, 
such as bidimensional measurements of lesions, suffer from measurement errors. 
For these reasons, statistical learning with indirect observations may play a 
crucial role for this problem. 

In this contribution, we address this problem in the general statistical learning 
context. The model can be described through 4 components: 

• a generator G of random variables X ^ X <ZW^ with unknown density / 
with respect to i^, a cr-finite measure defined on X , 
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• a supervisor S who associates to X an output Y ^ y, according to an 
unknown conditional probability, 

• a known linear compact operator A: L2{v, X) — ^ L2{v, X) which corrupts 
X given Z where Z has density Af with respect to v, 

• a Learning Machine LM which given n i.i.d. observations (Z^, Yi) returns 
an estimator y associated to any given x from the generator. 
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Figure 1. This representation 
has its origin in Va/pnik [2000]. 
Here, the presence of the nui- 
sance operator A makes the 
matter an inverse problem. 



The aim is to design a decision rule which returns, for each new generator's 
value X, a value y as close as possible to the supervisor's response y. Note 
that depending on the nature of the supervisor, Figure 1 contains models of 
classification, density estimation or regression. 

The more extensively studied model with indirect observations is the additive 
measurement error. In this case, we observe indirect inputs: 

= + e;, i = 1, . . . , n, 

where are i.i.d. with known density 77. It corresponds in Figure 1 to a 

convolution operator ■ f '-^ f * t] and we are faced to classification with 
errors in variables, density deconvolution, or regression with errors in variables. 

For these purposes, we introduce a bounded loss function £ : M x — [0, 1] 
and a class Q of measurable functions g : A" — >■ JR. To define the best approxi- 
mation, the problem is to choose from the given set of functions g G G, the one 
that minimizes the risk functional: 

R,{g)=Epe{g{X),Y). (1.1) 

The performances of a given g are measured through its non-negative excess 
risk, given by: 

R,{g)~Re{g*), (1.2) 

where g* is the minimizer over Q of the risk (1.1). It is important to point out 
that we do not adress in this paper the problem of model selection oi Q. It 
consists in studying the difference Ri{g*) — inf gRg{g), where the infimum is 
taken over all possible measurable functions g. Here, the target g* corresponds 
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to the oracle in the family Q. The purpose of this work is to use Empirical Risk 
Minimization (ERM) strategies based on a corrupted sample to minimize the 
excess risk (1.2). 

In the direct case, as we observe i.i.d. (Xi, Yi), . . . , with law P, a 

classical way is to consider the ERM estimator defined as: 

5„ = argmini?„(5), (1.3) 

see 

where Rn{g) denotes the empirical risk: 

1 " 

Rnig) - - V^(.9(X,),y,) = P„£(.g). 
n ^-^ 

In the sequel, the empirical measure of the direct sample {Xi,Yi), . . . , (X„, F„) 
will be denoted as Pn. A large literature (see Vapnik [2000] for such a general- 
ity) deals with the statistical performances of (1.3) in terms of the excess risk 
(1.2). To be concise, under complexity assumptions over Q (such as finite VC di- 
mension (Vapnik [1982]), entropy conditions (van de Geer [2000]), Rademacher 
complexity assumptions (Koltchinskii [200G]), it is possible to get both consis- 
tency and rates of convergence of ERM estimators (see also Massart and Ncdclcc 
[2006] in classification). The main probabilistic tool is the statement of uniform 
concentration of the empirical measure to the true measure. It comes from the 
so-called Vapnik's bound: 

Reisin) - Reig*) < Rtign) - Rn[gn) + Rnig*) - Rt{g*) 

< 2sup|(P„-P)Z(g)|. (1.4) 

see 

It is important to highlight that (1.4) can be improved using a local approach 
(see Massart [2000]). It consists in reducing the supremum to a neighborhood of 
g*. We do not develop these important refinements in this introduction for the 
sake of concision whereas it is the main ingredient of the literature cited above. 
It allows to get fast rates of convergence in pattern recognition. 

Here, the framework is essentially different. Given a linear compact oper- 
ator A, we observe a corrupted sample (Zi, Fi), . . . , (Z„, y„) where Zi, i = 
l,...,n are i.i.d. with density Af. As a result, the empirical measure P„ = 
■^J27=i^iXi.Yi) is unobservable and standard ERM (1.3) is not available. Un- 
fortunately, using the contaminated sample (Z^i, Yi ),..., (Z„, y„) in standard 
ERM (1.3) fails: 

n 

- Kg{Z^). Y,) m{g{Z), Y) ^ Ri{g). 

i—l 

Due to the action of A, the empirical measure from the indirect sample, denoted 
by Pn = ■^J27=i^{Zi,Yi), differs from P„ (in the sequel, we also note as P 

imsart-ps ver. 2008/08/29 file: noisystatlearii.tex date: July 11, 2012 



S. Loustau/ statistical learning with indirect observations 



4 



the corresponding true measure of {Z,Y)). We are facing an ill-posed inverse 
problem. This problem has been recently considered in Loustau and Marteau 
[2U11] for discriminant analysis with errors in variables. 

In this work, we suggest a comparable strategy in statistical learning. Given 
a smoothing parameter A = (Ai,...,Ad) G W^, we consider the following A- 
Empirical Risk Minimization (A-ERM): 

argmin i?^(c/), (1.5) 
where Rnig) is defined in a general way as: 

Rnid) = / l{gix),y)Pxidx,dy). (1.6) 
Jx 

The measure Px = Px{Zi,Yi, . . . , Z„, Y^) is data-dependent to the set of indi- 
rect inputs {Zi, . . . , Z„). It will be related to standard regularization methods 
coming from the inverse problem literature (see Engl et al. [1996]). As a conse- 
quence, it depends on a smoothing parameter A € R^J.. An explicit construction 
of Pa and the empirical risk (1.6) is detailled in Section 2 in pattern recognition 
with applications in Section 3. 

To study the performances of the minimizer of the empirical risk (1.6), it 
is possible to use empirical processes theory in the spirit of van de Geer [2000] , 
van dor Vaart and Wcllncr [199C] or more recently Koltchinskii [2006]. Following 
(1.4), in the presence of indirect observations, we can write^: 

Riig^,) ~ Riig*) < R^ig^,) ~ K{9n) + Kig*) - Mg*) 

< R^ig^i) - R^M) + ^'(5*) - R^tigl + {Ri - R^t){gn - g*) 

< sup|(i?^-i?,^)(/ -5)|+sup|(i?,^-i?,)(.g-5*)|, (1.7) 
see see 

where in the sequel, under integrability conditions and using Fubini: 

R^g) - Ei?,^(5) = j e{g{x),y)EPx{dx,dy). (1.8) 

Bound (1.7) is called Inverse Vapnik's bound. It consists in two terms: 

• A variance term sup^gg \ {Rn ~ Re)ig* ~ 5)1 related to the estimation of 
g* : this term can be controlled thanks to uniform exponential inequalities 
such as Talagrand's concentration inequality, applied to a class of functions 
depending on a parameter. 

-"^where with a slight abuse of notations, we write: 

{Ri, - Ri){g - g') = R,{g) - Rdg') - R^ig) + R^ig')- 
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• A bias term supggg \ {R^ ~ R(){9 ~ it comes from the estimation of 
P into the expression of Re{g) with estimator P\- This term is speeific 
to our method. However, it seems to be related to the usual bias term in 
nonparametric density estimation. Indeed, we can see easily that: 



The choice of A is crucial in the decomposition (1.7). We will show below that 
the variance term exploses when A tends to zero whereas the bias term vanishes. 
Parameter A has to be chosen as a trade-off between these two terms, and as 
a consequence will depend on unknown parameters. The problem of adaptation 
is not adressed in this paper but it is an interesting future direction. 

In this work, we consider y = {0, 1, . . . , A/} for M > 1. In other words, we 
study the model of classification with indirect observations (see Devroye et al. 
[199G] for a survey in the direct case). The contribution is organized as follows. In 
Section 2, we propose to give an explicit construction of the empirical risk (1.6) 
in classification thanks to the set of indirect observations. We state a general 
upper bound for the solution of the A-ERM (1.5) under minimal assumptions 
over the loss function £ and the complexity oi Q. It gives a generalization of the 
results of Koltchinskii [2006] when dealing with indirect observations. Section 
3 gives applications of the result of Section 2 in two particular settings. In 
the errors-in-variables case, we generalize the results of Loustau and Martcau 
[2011]. For the general case, we use projection in the spectrum of operator A. We 
state rates of convergence which generalize the existing fast rates of convergence 
pointed out by Koltchinskii [2006]. There coincide with a recent lower bound 
proposed in discriminant analysis by Loustau and Marteau [2011]. Section 4 is 
devoted to a discussion related to the complexity assumption when we deal 
with indirect observations whereas Section 5 concludes the paper. Section 6 is 
dedicated to the proofs of the main results. 

2. General Upper Bound 

In this section, we detail the construction of the empirical risk (1.6) in classifi- 
cation. We give minimal assumptions to control the expected excess risk (1.2) of 
the procedure. The construction of the empirical risk is based on the following 
decomposition of the true risk: 



where fy{-) is the conditional density of X\Y = y and p{y) = P{Y = y), for any 
y £ y = {0, . . . , M}. With such a decomposition, we suggest to estimate each 
fy{-) using a nonparametric density estimator. To state a general upper bound, 





(2.1) 
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we consider a family of estimators such as: 

'iyeyJy{x) = —Y,k^{Zlxl (2.2) 

where Uy = card{i : Yi = y} , k\ : X x X ^ R and the set of inputs (^f )"^i = 
{Zi,i = l,...,n:Yi^y}. 

Here, we consider a constant bandwidth A for any y E y in fy. It illustrates 
rather well the difference of our approach with plug-in type estimators (see 
Audibort and Tsybakov [2007] for instance). If wc want to estimate fy, for each 
y G y, the bandwidth A in (2.2) has to depend on Uy and the regularity of fy. 
However, the aim is to estimate the true risk Re{g)- To get satisfying upper 
bounds, we will see that A docs not necessary depend on the value y € y. 

It is also important to remark that assumption (2.2) provides a variety of 
nonparametric estimators of fy. For instance, if Af = / * 77 is a convolution 
operator, we can construct a deconvolution kernel provided that the noise has a 
nonnuU Fourier transform. This is a rather classical approach in deconvolution 
problems (see Fan [1991] or Meister [2009]). Another standard example of (2.2) 
is to consider projection estimators of the conditional densities using the SVD 
of operator A or many other regularization methods (see Engl et al. [1996]). 
Section 3 describes these examples. 

Finally we plug estimators (2.2) in the true risk (2.1) to get an empirical risk 
defined as: 

Rn{9) = Yl I ^{9{x),y)fy{x)v{dx)p{y), 

where p{y) = ^ is an estimator of the quantity p{y) = P(y = y). Thanks to 
(2.2), this empirical risk can be written as: 

1 " 

^n(ff) = -E^^(5,(^z,>^.)), (2.3) 

i=l 

where £xig, [z, y)) is a modified version of ({g{x),y) given by: 

hi9,iz,y))= / (.{g{x),y)kx{z,x)v{dx). 
J X 



In this section, we study general upper bounds for the expected excess risk 
of the estimator: 

1 " 

5,^ -argmin-V^A (5,(^^,5^.))- (2-4) 
n ^-^ 

1=1 

In case no such minimum exists, we can consider a (^-approximate minimizer as 
in Bartlctt and Mondelson [2006] without significant change in the results. 
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The main idea is to use iteratively a deviation inequality for suprema of empirical 
processes due to Bousquet [2002]. It allows to control the increments of the 
empirical process: 

v^{g) = ^ V (^(5, {Z.,Y,)) - E4(ff, {Z, Y))) . 



Here, it is important to note that Talagrand's type inequality has to be applied 
to the class of functions ^ £\{g, {z,y)), g G Q}. This class depends on 

a regularization parameter A. This parameter will be calibrated as a function 
of n and that's why the deviation inequality has to be used carefully. For this 
purpose, we introduce in Definition 1 particular classes {^a(9),9 G G}. 

Definition 1. We say that the class {^x{g),g € Q} is a LB-class (Lipschitz 
bounded class) with respect to /x with parameters (c(A), K{X)) if these two prop- 
erties hold: 

(Lp) {ix{g),g G Q} is Lipschitz w.r.t. fi with constant c{X): 

yg,g'eg, |Ka(.9)-^a(,9')IIl,(p) <c(A)|K(.g)-^(.9')IU.(^)- 
(B) {ix{g),g G g} is uniformly bounded with constant K{X): 

sup sup \ix{g,{z,y))\ < K{\). 

see 

A LB-class of loss function is Lipschitz and bounded with constants which 
depend on A. These properties are necessary to derive explicitly the upper bound 
of the variance in (1.7) as a function of A. 

More precisely, the Lipschitz property (L^) is a key ingredient to control the 
complexity of the class of functions {£x{g),g <= G}- In the sequel, we use the 
following geometric complexity parameter: 

u^ig,s,^i)^E sup {P - Pr,){ex{g) ~ ix{g')) 

9,a'eQ-\\t{a}-e{g')\\L2M<s 

The control of such a quantity is proposed in Section 4 thanks to standard 
entropy conditions related to the class Q. 

Finally (B) is necessary to apply Bousquet's inequality to the class of func- 
tions {^x{g) — (^xig'), g G Q}, which depends on the smoothing parameter A. This 
condition could be relaxed by dint of recent advances on empirical processes in 
an unbounded framework (see Lccuc and Mcndelson [2012] or Lederer and van de Geer 
[2012]). 

Definition 2. For k > I, we say that T is a Bernstein class with respect to ^ 
with parameter k if there exists kq > such that for every f G T : 



|2 



<Ko[Ep/]' 
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This assumption first appears in Bartlctt and Mcndclson [2006] for ji ^ P 
when F = {(.{g) — 3i ff' S 5} is the excess loss class. It allows to control the 
excess risk in statistical learning using functional's Bernstein inequality such as 
Talagrand's type inequality. It goes back to the standard margin assumption in 
classification (see Mammen and Tsybakov [1999], Tsybakov [2004b]), where in 
this case k = for a so-called margin parameter a > 0. 

Definition 2 has to be combined with the Lipschitz property of Definition 1. 
It allows us to have the following scric of inequalities: 

¥x{9) - W)\\l.(p) < c(A)||/|U,(^) < c(A) (Ep/)^ , (2.5) 

where f G T = {£{g) — ^(.9*), 5 G ^} is the excess loss class. 

Last definition provides a control of the bias term in (1.7) as follows: 

Definition 3. The class {ixig),g G G} has approximation function a(A) and 
residual constant < r < 1 if the following holds: 

V.g e g, (Re - R^)ig - g*) < a(A) + r{Re{g) - Rt{g*)). 

where with a slight abuse of notations, we write: 

{Ri - R^){g -g*) = Rt{g) - R,[g*) - R^{g) + RUdl- 

This definition warrants a control of the bias in the Inverse Vapnik's bound 
(1.7). It is straightforward that with Definition 3, we get a control of the excess 
risk as follows: 

Ri{9^) - Ri{g*) < T^fsup \iP„^P)iexig)~ixig*))\+a{\)] , 
where in the sequel: 

g{5) = {geg :Rt{g)-Ri{g*)<S}. 

Explicit functions a(A) and residual constant r < 1 are obtained in Section 3. 
There depend on the regularity conditions and allow to get rates of convergence. 
We are now on time to state the main result of this section. 

Theorem 1. Consider a LB-class {£\{g),g G g} with respect to fi with param- 
eters (c(A), A'(A)) and approximation function a(A) such that: 

2k 2k k + p— 1 

a(A) < Ci ^ and K(X) < ^ , (2.6) 

V V " / 1 + log n 

for some Ci > 0. 

Suppose {i{g) — i{g*),g £ g} is Bernstein with respect to fi with parameter 
K> 1 where g* £ argming R({g) is unique. Suppose there exists < /? < 1 such 
that for every 5 > Q: 

(i„(^,<5,/.) =E sup \P-Pn\{ix{g)-U9'))<C2'^6^-Pi2. 
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for some C2 > 0. 

Then estimator defined in (2.4) satisfies, for n great enough: 




where C = C(Ci, C2, k, kq, p) > 0. 

The proof of this resuh is presented in Section 6. Here follows some remarks. 

This upper bound generalizes the result presented in Koltchinskii [2006] to the 
indirect framework. Theorem 1 provides rates of convergence (c(A) / ^Jnf'^^'^'^^'' ^ . 
In the noise- free case, with standard ERM estimators, Koltchinskii [2006], Tsybakov 
[2004b] obtain fast rates ^-'^/^K+p-i^ ^j^g 

presence of contaminated inputs, 
rates are slower since c(A) — >■ +00 as n — >■ -\-oo. Hence, the price to pay for the 
inverse problem is quantified by the Lipschitz constant c(A) in Definition 1. 

The behavior of constants c(A) depend on the difficulty of the inverse problem 
through the degree of ill-posedness of operator A. Section 3 proposes to deal with 
midly ill-posed inverse problems. In this case, c(A) depend polynomially on A. 

The Lipschitz property introduced in Definition 1 is central. Gathering with 
the complexity assumption (2.7), it leads to a control of the variance term in 
decomposition (1.7). The first statement of condition (2.6) gives the order of 
the bias term. It leads to the excess risk bound. 

The second part of (2.6) is due to the use of a deviation's inequality from 
Bousquct [2002] to the class {l\{g),g E G}- In Section 3, we give explicit con- 
stants c(A) and K(X). It appears that this assumption is always guaranteed. 

The control of the modulus of continuity in (2.7) is specific to the indirect 
framework. It depends on the Lipschitz constant c(A). A comparable hypothesis 
can be found in the direct case in Koltchinskii [2000], except for the constant 
c(A). Section 4 is dedicated to the statement of (2.7). Under standard complexity 
conditions, such as L2(/x)-entropy of the loss class {^{g),g S G}, (2.7) holds true 
(see Lemma 1 in Section 4 and the related discussion). It allows us to consider 
many examples of hypothesis spaces from finite VC classes to more complex 
functional classes such as kernel classes. 

At this time, it is important to note that Theorem 1 depends on measure fi 
introduced in Definition 1 and 2. In the rest of the paper, we will consider two 
particular cases: fji = v ® Py (/i = vy for short in the sequel) and /.( = P. The 
Lipschitz property (L^) with fi = P is stronger than (L^) with fi = v ® Py. 
Indeed, for any measurable function /i : A" x 3^ — >• M, if ||/y||oo < Cy, Vy G y: 



Since || • ||l2(P) ^ ' \\l2{vv) ^^"^ some C > 0, a Bernstein class with re- 
spect to vy is also Bernstein with respect to P (see Definition 2). The most 
favorable case = uy) arises in binary classification (see Tsybakov [2004b] or 
Massart and Nedclec [2000]). Section 3 states rates of convergence in these two 
different settings. 




y&y 
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Finally, Theorem 1 requires the unicity of the Bayes g* . Such a restriction 
can be avoided using a more sophisticated geometry as in [Koltchinskii, 2006, 
Section 4]. 

3. Applications 

In this section, we propose to apply the general upper bound of Theorem 1 to 
give rates of convergence of A-ERM in two distinct frameworks. The first result 
deals with the errors-in- variables case where operator ^ is a convolution prod- 
uct. Using kernel deconvolution estimators, we obtain fast rates of convergence. 
Then, we consider the general case using a family of projection estimators into 
the SVD basis of the operator. We also consider two different settings in the 
sequel, namely = vy and jJ, = P (see the discussion at the end of Section 2). 
In this case, we restrict the study to a compact set K Q X. 

3.1. Errors-in-variables case 

The elementary model of indirect observations is the additive measurement error 
model with known error density. In this case, we suppose that we observe a 
corrupted training set (Zi, Yi), i = 1, . . . , n where: 

Z, = Xi + i = 1, . . . ,n. 

The sequence of random variables ei, . . . , e„ are i.i.d. R'^-random variables with 
density rj with respect to the Lebesgue measure on M.'^. In this situation, oper- 
ator A is exactly known as a convolution product with density rj. Note that in 
practical applications, this knowledge cannot be guaranteed. However, in most 
examples, we are able to estimate the error density r] from replicated mea- 
surements. In the sequel, we do not address this problem and we focus on the 
deconvolution step itself. 

In the errors-in-variables case, the difficulty of this inverse problem can be repre- 
sented thanks to the asymptotic behavior of the Fourier transform of the noise 
density rj. Assumption (Al) below concerns the asymptotic behavior of the 
characteristic function of the noise distribution. These kind of restrictions are 
standard in deconvolution problems (see Butucea [2007], Fan [1991], Moister 
[2009]). 

(Al) There exist (/3i, . . . , /3rf)' G such that for all i e {1, . . . , d} , Pi > ^ 
and: 

\J'h]it)\ - |t|-'^%as t ^ +00, 

where J-[r]i] denotes the Fourier transform of rji. Moreover, we assume that 
T[r]i]{t) ^ for allteM. and i G {1, . . . 

Assumption (Al) focuses on moderately ill-posed inverse problems by consid- 
ering polynomial decay of the Fourier transform. Notice that straightforward 
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modifications in the proofs allow to consider severely ill-posed inverse problems. 
In tliis framework, we construct kernel deconvolution estimators of the densi- 
ties fy,y G y. For this purpose, let us introduce JC = 0^=1 : R'^ — >■ M a 
d-dimensional function defined as the product of d unidimensional function /Cj . 
Then if we denote by A = (Ai, . . . , Ad) € K'^ a set of (positive) bandwidths, we 
define /C„ as 



JCri 



— ^ 



t ^ ICr,{t) = 



-^W(-) 

-^W(-/A) 



it). 



(3.1) 



To apply Theorem 1, we also need the following assumption on the regularity 
of the conditional densities: 



(Rl) Given 7, L > 0, for any y <E y , fy <E Hij, L) where: 

H(7, L) = {/ G S(7, L) : / are hounded probability densities w.r.t. Lebesgue}, 

and T,{^,L) is the class of isotropic Holder continuous functions f having con- 
tinuous partial derivatives up to order [7J , the maximal integer strictly less than 
7 and such that: 

\f{y)~PfAy)\<L\^-yr, 

where pf^ is the Taylor polynomial of f at order [7J at point x. 

This Holder regularity is standard to control the bias term of kernel estimators in 
density estimation or density deconvolution (see for instance Tsybakov [2004a]). 
In this context, for all g G t/, we define the A-ERM (2.4) with empirical risk: 

1 " 

^'(ff) = -E^^(5,(^.,>^0), (3.2) 
n ^ — ' 

1=1 

where (.\{g, {z,y)) is given by: 

(■x{.9,{z,y)) = _^/(5(a;),2/)^/C,, 

where with a slight abuse of notations we write for any z = (zi, . . . , z^), 
x= (a:i,...,Xd) eM^ A= (Ai,...,Ad) eM^: 

A^" \rir) - A-^" i^^' ■ ■ ■ ' j ■ 

Theorem 2 below presents the rates of convergence of A-ERM under assumptions 
(Al)-(Rl). 
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Theorem 2. Suppose {i{g) — ({g*),g (z G} is a Bernstein class with respect 
to vy with parameter k > 1 and £{g{-),y) £ L2(R''), for any y E y. Suppose 
< p < 1 exists such that: 

CJn{G,6,iyY) < Ci^6^-P,y0 <S<1, 

for some Ci > 0. 

Under (Al) and (Rl), we have, for n great enough: 

sup IEi?K5) - Rl{9*) < Cn"^(2;^+^3fp:(57?ri3^^ 
/„G-H(7,i) 

where P = X^iLi Pi '^'^d A = (Ai, . . . , Ad) is given by: 

yi e {I,. . .,d}, Xi ^ n~ 27(2»+p-^)+2(2»-i)/3 _ (33) 

The proof of this rcsuh is postponed to Section 6. Here follows some remarks. 

Rates in Theorem 2 generalize the result of Koltchinskii [2006] (see also 
Tsybakov [2004b]) to the errors-in- variables case. Point out that if /? = 0, we 
get the rates of the direct case. Here, the price to pay for the inverse problem 
of deconvolution can be quantified as ^^^""'"^^ , where k > 1. Hence, the perfor- 
mances of the method depend on the behavior of the characteristic function of 
the noise distribution. In pattern recognition, it is important to notice that the 
influence of the errors in variables is related to both parameters k and 7. Same 
phenomenon also occurs in Loustau and Martcau [2011]. 

It is also interesting to study the minimax optimality of the result of Theorem 2 
using the lower bounds presented in Loustau and Martcau [2011]. For this pur- 
pose, let us introduce a random couple {X, Y) with law P on X x {0, 1}. Given 
G and the class of associated candidates {g{x) = 1g{x), G G G}, we consider 
the hard loss £H{g{x),y) = \y — lG(a;)|- In this case, the Bayes risk is defined 
as: 

RHiG)=E\Y- lGiX)\. 
It is easy to see that for y £ {0, 1} and g{x) — 1q{x), we have: 

\eH{g{x),y)-£Hig'ix),y)\^\\y- lG(x)|-|y- 1g'{x)\\ ^ \ 1g{x) - 1g'{x)\. 

Gathering with the margin assumption, Lemma 2 in Mammen and Tsybakov 
[1999] allows us to write: 

pH(5)-^H(g')llL(.W = ll Ig'IILck'') - dA(G,G") 

< j{RHig)-RHig'))'^ ■ 

As a result, provided that G* € G and under the margin assumption, the excess 
loss class {inig) — ^h{9*)} is Bernstein with respect to ^ = vy with parameter 
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a 

To apply Theorem 2, we need to check (L^) and (B) from Definition f . Remark 
that from Lemma 3 in Loustau and Martcau [2011], we have: 

\\h{9)~lx{9')\\l,^P)<Cnt,\f'd^{G,G'), 

where for any g — Ig: 

^x{9,z,y)^J eH{g{x),y)jJC,, dx. 

Consequently, {lxig),g = '■ G Cz G} is a LB-class with respect to vy with 
constants c(A) and K{X) given by: 



C{X)=UU\7^' and/^(A)=ntiA: 



The last step is to control the complexity parameter S, vy) as a function of 

5. With Lemma 5.1 in Audibcrt and Tsybakov [2007], a control of the L2{vy)- 
entropy with bracketing of the class { Ig, G € 5} is given by: 

\ogN{{ Ig, G e Q},L2{vY),e) < ce-^, 

under a plug-in type regularity assumption such as (Rl). As a result, we can 
apply Lemma 1 in Section 4 to get a control of the desired modulus of continuity 
as follows: 

for some Ci > 0. 

Finally, using Lemma 4 in Section 6, in the particular case of the hard loss, 
{£a(.9),.9 € t/} has approximation power a(A) with constant < r < 1 given by: 

d 



a(A) = ^ Aj" ^'^ and r = — . 



4=1 



In this case. Theorem 2 leads to: 



Ei?ffa^)-i?ff(g*) < Cn 



7(Q + 2) + d+23 



This rate corresponds to the minimax rates of classification with errors in vari- 
ables stated in Loustau and Martcau [2011]. It ensures the minimax optimality 
of the method in the errors-in-variables case for this particular loss. An open 
problem is to give a lower bound for more general losses. 

3.2. General case with singular values decomposition 

In this section, we observe a training set (Zi, Yi), i = 1, . . . , n where Zi are i.i.d. 
with law Af, where A : L2{X) — > L2{X) is a known linear compact operator. 
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For simplicity, wc also restrict ourselves to moderately ill-posed inverse problem 
considering the singular values decomposition of A. Since A is compact, A* A is 
auto-adjoint and compact. We can find an orthonormal basis of eigenfunctions 
of A* A, denoted by {(t>k)km' ■ We obtain A*A4>k = bl4>k, with (6fe)fcgN* the de- 
creasing sequence of singular values. Considering the image basis tpk = A(j)k/bk, 
we have the following SVD (singular values decomposition): 



In this case, the rate of decrease of the singular values is polynomial. As an 
example, we can consider the convolution operator above and from an easy 
calculation, the spectral domain is the Fourier domain and (A2) is comparable 
to (Al). However assumption (A2) can deal with any linear inverse problem 
and is rather standard in the statistical inverse problem literature (see C^avalicr 



In this framework, we also need the following assumption on the regularity 
of the conditional densities into the basis of the operator A: 

(R2) For any y ^ y, fy<E 7^(7, L) where: 

V{^,L) = {/ G 0(7, i) : / are bounded probability densities w.r.t. Lebesgue }, 
and 0(7, i) is the ellipsoid in the SVD basis defined as: 



Considering the SVD (3.4), we propose to replace in the true risk the conditional 
densities fy by a family of projection estimators given by: 



A(l)k = bkiJk and = bkcpk, k e N*. 



(3.4) 



In the sequel, we make the following assumption: 
(A2) There exists f3 € R-|_ such that: 

bk ^ k~^a.s k +00. 



[2008]). 



9(7, L) = {fix) = E ^kk"" < L}. 



k>l k>l 



N 




(3.5) 



k=l 



where 9^ is an unbiased estimator of 0^ ~ / fy(j)kdv given by: 




(3.6) 



In this case, assumption (2.2) is satisfied with kf^{z,x) = X^feLi ^/c ^4>k{z)(j3k{x). 
It gives the following expression of the empirical risk: 



n ^ — ^ 



i=l 
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where: 

N 



Next theorem states the rates of convergence for the ERM estimator defined 
as: 



1 " 

9n = argmin - V ^^(g, Z„ Y,). 



gee n . 

i— 1 



Theorem 3. Suppose {i{g) — (i{g*),g G Q} is Bernstein class with respect to 
vy with parameter k > 1 such that £(g{-),y) € L2{i'), for any y € y. Suppose 
< p < 1 exists such that: 



Con{G,S,iyY) < Ci^S^-P, VO < S 



< 1, 



for some Ci > 0. Then under (A2) and (R2), g„ satisfies, for n great enough. 



sup ERe (g„ ) - Rtig*) < Cn~ -,(2.+p-i)+i2.~i)^ ^ 
where we choose N such that: 

2k-1 

N ~ 7'j2t(2k + p-1) + 2(2k-1)3 ^ 

Theorem 3 shows that in pattern recognition with indirect observations, we 
can deal with any hnear compact operator A using the SVD. From this point of 
view, this result could be compared with Klcmela and Mammon [2010] where 
white noise model is considered. 

Rates of convergence in Theorem 3 are comparable with Theorem 2. If yl is 
a convolution operator, the result above shows that g^ using projection esti- 
mators in the SVD reaches the rate of Theorem 2 using kernel deconvolution 
estimators. In this case, the regularity assumption deals with ellipsoids in the 
SVD domain instead of Holder classes. However, we can conjecture that this 
result is also minimax, although a rigorous lower bound has to be managed. 
Finally, this result might be extended to other linear regularization methods 
without significant change. Here, we present the result for projections into the 
SVD domain for the sake of simplicity in the proofs but Tikhonov and Landwe- 
ber regularization could be considered for instance. 



3.3. Restriction to a compact K 

In this subsection, wc develop an alternative to Theorem 2-3 to deal with a 
weaker Bernstein assumption. For the sake of simplicity, we restrict ourselves 
in Theorem 2-3 to Bernstein class with respect to measure vy = v Py (see 
Definition 2). In this case, it is sufficient to deal with LB-class with respect to 
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i/y in Definition 1, thanks to (2.5). However, Bernstein classes with respect to 
appear only in particular case, such as classification with hard loss in the 
context of Mamnien and Tsybakov [1999], Tsybakov [2004b] (see Section 3.1). 
Here, we present a corollary of Theorem 2-3. It allows us to deal with Bernstein 
classes in the spirit of Bartlett and Mendelson [2006], namely such that: 

Ep.f < Ko {Epff" , V/ e ^ = {e{g) - £{g*),g e G}. 

The idea is to restrict the study to a set K C R"* where / > co > over K. For 
this purpose, we can consider a set Q of classifiers g such that {x G X : f{x) > 
0} C K. We can also introduce the following loss: 

^\.Kig,z,y)^ kx{z,x)l{g{x),y)v{dx). (3.7) 

JK 

It means that we deal with the minimization of a true risk of the form: 




With (3.7), it is straightforward to get (L^) with jJ. — P since if / > cq > on 
K, one gets: 




Roughly speaking. Assumption (L^) in Definition 1 whith ji = P provides a 
control of the variance of £x{g, (Z, Y)) by the variance of £{g{X), Y). To have a 
control of the L2 (P)-norm with respect to the L2 (P)-norm, we need to restrict 
the problem to {x : f(x) > 0}. Otherwise, the variance of £x{g, {Z,Y)) cannot 
be compared with the variance of £{g{X), Y). 

The following corollary points out the same performances for the A-ERM over 
K defined as: 

n 

p^'^ = argminY, ^^.K{g,Z„Y). 

Corollary 1. Suppose {£{g) — £{g*),g G Q} is a Bernstein class with respect 
to P with parameter k > I and £(g{-),y) G £2(1'), for any y (z y. Suppose 
< p < 1 exists such that: 

CJn{g,S,P) < Ci^s^-p, VO < <5 < 1. 

1. Under (Al) and (Rl), g^'^ satisfies, for n great enough: 

sup ERe^Kign'^) - Ri,k{9*) < Cn" , 

where /3 = X)iLi A '^^'^ for a choice of \ = (Ai, . . . , A^) given by: 

Vi e {l,...,d}, Ai = n"2-,(2K+p-T)+'2(2»-i),3 , (3 8) 



imsart-ps ver. 2008/08/29 file: noisystatlearn.tex date: July 11, 2012 



S. Loustau/ statistical learning with indirect observations 17 
2. Under (A2) and (R2), cjn'^ satisfies, for n great enough: 

sup ¥.Ri^K{gn''^) - Ri,K{g*) < 

where we choose N such that: 

2k-1 

_/V = 77,2T(2K + p-l) + 2(2K-l)fl _ 

This corollary allows to get the same fast rates of convergence of Theorem 
2-3 under a weaker Bernstein assumption. The price to pay for the A-ERM with 
restricted loss (3.7) relies on the dependence on K of the estimation procedure. 

4. Complexity from indirect observations 

The main results of this paper rely on the control of the indirect modulus of 
continuity defined as; 

unig,s,fi)^E sup \p-Pn\iex{g)-exig'))- 

In this section, we intent to upper bound this quantity thanks to standard 
learning theory arguments. The first result links the control of uJn{G, <5, /^) to the 
bracketing entropy of the loss class, which generalizes the result of the direct 
case (see van dcr Vaart and Wdhic^r [1990]) when A = Id. 

Lemma 1. Consider a LB-class {(x{g),g G G} with respect to ^ with Lipschitz 
constant c(A). Then, given some < p < 1, we have: 

nemg), g e g},e,L2{fi)) < ce-^p ^Co,,{g,6,fi) < c^s^-p, 

where 'HB{{^{g), g G tj}, e, L2(/^)) denotes the e-entropy with bracketing of the 
set {€(.g), g & Q} with respect to L2{fJ.) (see van der Vaart and Wellner [1996] 
for a definition). 

With such a Lemma, it is possible to control the complexity in the indirect 
setup thanks to standard entropy conditions. The proof is presented in Section 
6. It is based on a maximal inequality due to van dcr Vaart and Wellner [1996] 
applied to the class: 

= {ixig) - ex{g'),g,g' e G ■■ 11%) - eig')\\, < s}. 

For instance, let us consider a loss £ such that t h-s- £(j/, t) is a convex function, 
for any y & y. Both least squares or large margin classification can be viewed 
as special cases of convex losses where l{y,t) — (y — t)'^ or l{y,t) — ^{yt) 
respectively, with a given convex function $ (such as $(w) ~ (I — u)+ for the 
hinge loss). In this case, using the convexity of the loss, it is straightforward to 
obtain with Lemma 1 the following corollary. 
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Corollary 2. Suppose £{-,y) is convex for any y Gy, {i{g) — ^(.9*),. 9 £ G} is 
a Bernstein class with respect to /i with parameter k > 1 and £{g{-), y) G L2{v), 
for any y (z y . Suppose < p < 1 exists such that: 

nB{g,e,L2in)) <ce"^P,yO<e<l. 

1. Under (Al) and (Rl), the solution of the minimization (1.5) satisfies: 

sup E.Ri{g^) - Reig*) < Cn~ ^(s-'+f-") +(2»-i)/3 ^ 

where /3 = X)f=i A '^^'^ ^ = ("^i; ■ ■ ■ ^ ^d) is given by: 

2k-1 

\fi e {1,. . . ,d}, Xi = n 2^(2»+p-i)+2(2„-i),3 _ (4 1) 

2. Under (A2) and (R2), g^ satisfies: 

sup Ei?£(5,^) - RpXg*) < Cn"^(2»+p-i)+(2K-i)3 ^ 

/»6-P(7,i) 

where we choose N such that: 

2k- 1 

N = 7^27(2k + p-1) + 2(2k-1)^I _ 

This corollary is a special version of somewhat more general analysis of the 
previous sections. It allows to consider standard hypothesis sets G such as VC 
classes or kernel classes (Massart and Ncdelec [2006] or Mcndclson [2003]). 

Another possible powerful direction is to study directly the complexity of the 
class {i\{g),g & G} thanks to entropy numbers of compact operators. For this 
purpose, note that if X is compact, £\{g,z,y) = k\{z,x)l(g{x),y)v{dx) can 
be considered as the image of t{g) by the integral operator L^^ associated to 
the function kx. Hence we have: 

{^x{g),geG}^Lk,mg),g&G}). 

Furthermore, it is clear that if k\ is continuous, Lk^ is well-defined and compact. 
Using for instance Williamson et al. [2001], and provided that I is bounded and 
G consists of bounded functions in ^2(1^, X), entropy of the class {^A(ff), g G G} 
could be controlled in terms of the eigenvalues of the integral operator. In this 
case, it is clear that the entropy of the class {^a(5),5 € G} depends strongly on 
the spectrum of the operator A. 

More precisely, if A is a convolution product, Section 3.1 deals with kernel 
deconvolution estimators. As a result, operator i^.^ is defined as the convolution 
product Lk^f{z) = jl'^riij) * f{z). Its spectrum is related to the behavior of 
the Fourier transform of the deconvolution kernel estimator, which corresponds 
to the quantity j^^^prj. At the end, the control of the entropy of the class of 

interest {(.x{g)-,g & G} could be calculated thanks to an assumption over the 
behavior of the Fourier transform of the noise distribution 77 such as (Al). 
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5. Conclusion 

This paper has tried to investigate the effect of indirect observations into the 
statement of fast rates of convergence in empirical risk minimization. Many 
issues could be considered in future works. 

The main result is a general upper bound in the statistical learning context, 
when we observe indirect inputs Zi, i — with law Af . The proof is 

based on a deviation inequality for supprema of empirical processes. It seems to 
fit the indirect case provided that it is used carefully. For this purpose, we intro- 
duce Lipschitz and bounded classes {^a(5),5 G 5}; depending on a smoothing 
parameter A. It allows us to quantify the effect of the inverse problem on the 
empirical process machinery. The price to pay is summarized in a constant c(A) 
which exploses as A — >■ 0. The behavior of this constant is related to the degree 
of ill-posedness. Here in the midly ill-posed case, c(A) grows polyniomally as a 
function of A. 

The result of Section 2 suggests the same degree of generality as the re- 
sults of Koltchinskii [2006] in the direct case. It is well-known that the work 
of Koltchinskii allows to recover most of the recent results in statistical learn- 
ing theory and the area of fast rates. Consequently, there is a nice hope that 
many problems dealing with indirect observations could be managed following 
the guiding thread of this paper. 

The estimation procedure proposed in this paper can be discussed for several 
reasons. Firstly, it is not adaptive in many sense. At the first glance, we can 
see three levels of adaptation: (1) adaptation to the operator A; (2) adaptation 
to the tunable parameter A; (3) adaptation or model selection of the hypoth- 
esis space Q. At this time, it is important to note that at least in the direct 
case, the same machinery used to analyzed the order of the excess risk can be 
applied to produce penalized empirical risk minimization (see Blanchard et al. 
[2008], Kohchinskii [2006], Loustau [2009], Tsybakov and van dc Gccr [2005]). 
However, the construction of adaptive versions of A-ERM of the previous sec- 
tions is a challenging open problem. 

Finally, the aim of this contribution was to derive excess risk bounds under 
standard assumptions over the complexity and the geometry of the considered 
class Q. An alternative point of view would be to state oracle- type inequali- 
ties. Indeed, Theorem 1-3 could be written in terms of exact asymptotic oracle 
inequalities of the form: 

¥.Ri{gl) < inf i?,(.g)+r„(g), 
g&Q 

where the residual term r„(CJ) corresponds to the rates of convergence in The- 
orem 1-3. In this setting, it is well-known that ERM estimators reach optimal 
fast rates under a Bernstein assumption. However, the Bernstein assumption 
presented in Definition 2 is a strong assumption related to the geometry of the 
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class Q. Lccuc and Mcndclson [2012] proposes to relax significantly the Bern- 
stein assumption and point out non-exact oracle inequalities of the form: 

Ei?j(5^)< (1 + e) M Ri{g) + rn{g), 
gee 

for some e > 0. These results hold without Bernstein condition for any non- 
negative loss functions. There is a nice hope that such a study can be done 
in the presence of indirect observations, using some minor modifications in the 
proofs. 

6. Proofs 

The main ingredient of the proofs is a concentration inequality for empirical 
processes in the spirit of Talagrand (Talagrand [1996]). We use precisely a Ben- 
net deviation bound for suprema of empirical processes due to Bousquet (see 
Bousquct [2002]) applied to a class of measurable functions f G J- from X into 
[0, K]. In this case it is stated in Bousquet [2002] that for all t > 0: 



P\^Z >EZ+ v/2t(ncr2 + (1 + K)EZ) + 3 j ^ cxp(-i), 

where 

and sup Var(/(Xi)) < a^. 

The proof of Lemma 2 below uses iteratively Bousquet's inequality and gives 
rise to solve the fixed point equation as in Koltchinskii [2006]. For this purpose, 
we introduce, for a function ip : M_|- — > M_|-, the following transformations: 

i^(d) = sup and ^^(e) = inf{<5 > : -4^(6) < e}. 

<j>s o- 

We are also interested in the following discretization version of these transfor- 
mations: 

^,{S) = sup and Vj(e) = inf{<5 > : V^,(5) < e}, 

5j>S Oj 

where for some q > 1, 6j ^ q^^ for j G N. 

In the sequel, constant K,C > Q denote generic constants that may vary 
from line to line. 

6.1. Proof of Theorem 1 

Lemma 2. Suppose {(x{g),g G G} is such that: 

sup|Ka(.9)||oo < K{\). 



Z = sup 
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Suppose {ix{g),g G G} has approximation function a(A) and residual constant 
< r < 1 according to Definition 3. Define, for some constant K > Q: 



U^{S,,t) = K 



ci>^{g,6,) = E sup \Pn - P\[h{g) ~ e^{g% 



D\g,s,)^ sup ^ P{e^{g) - i^{g')y 



g,g'eg(Sj) 

if 

nRl{9)-Ri{9l> 5) <\ogA\)e 



Then yS > 6^:it) = [[/;^(., t)]t ( i^), z/a(A) < '-j;fd we have for g = g^: 



Proof. The proof follows Koltchinskii [2006] extended to the noisy set-up. 
Given g > 1. wc introduce a sequence of positive numbers: 

Given n,j>l,t>0 and A G IR^j., consider the event: 

KA^) = I sup |F„ - P\Mg) ~ < U^{S„t) \ . 

[g-g'eg{Sj) J 

Then, we have, using Bousquet's version of Talagrand's concentration inequality 

(see Bousquct [2002]), for some K > 0, P(i;^^(t)^) < e"*, Vi > 0. 

We restrict ourselves to the event E^ j (i) . 

Using Definition 3, we have with a slight abuse of notations: 

Ri{g)-Ri{g*) < {P^ ^ P){W) ~ lx{g)) + [R, ^ R^){g ~ g*) 

< [Pn - P){W) - ixig)) + a{\) + r{Re{g) - R,{g*)). 

Hence, we have: 

< Rdg) - Ri{g*) < 5,+, < ^ ((P„ - P){£x{g*) - ix{g)) + a(A)) . 
On the event E^^j{t), it follows that \/S < Sj: 

Sj+i < Ri (g) - Ri {g* ) < S, ^ 6^+, < ^ [S, , t) + j^aiX) 

< -^V^\S,t) + -^aiX), 
1 — r 1 — r 
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We obtain: 



since we have: 



Ui:{5,,t)<5,V,^{5,t)y5<5, 



-^V^{5,t)>- - ^a{X) > ^, 
1 — r q 1 — r 2q 



«(A) < ^6 



1 — r 2q 



It follows from the definition of the f -transform that: 

S<[u,>:{;t)]\^)^6:;it). 

Hence, we have on the event ^{t), for Sj > S: 



or equivalently, 



where G{c,C) ^ {g e G : c < RpXg) - Rt{g*) < C}. We eventually obtain: 
fl El^{t) and <5 > 5^{t) ^ R,{g) - R,{g*) < S. 



This formulation allows us to write by union's bound: 

PiRS) Riigl >5)<J2 nEijitf) < log, (l) e"*. 



since {j:5,>6} = {j:j<^'i^^}. 



□ 



Proof of Theorem 1. The proof is a direct application of Lemma 1. We have, 
for some constant K > 0: 



^niQ.s) + ^^lU^^[g,5){i + K{x)) + JiD\g,s) + ^ 



Using the Bernstein condition gathering with the complexity assumption over 
^nig, S), we have: 

<P^{g,S) < E sup \Pn~P\[ixig)~ix{g')] 

< E sup ^\P„-P\[ix{g)~exig')]^Ur,{g,2^S^) 

S,s'e6:|K(g)-<?(g')llL2(„)<2V^557; 

< c'Ms"^. 
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A control of D^{g,5) using the Lipschitz assumption leads to: 



+ a/^c(A)^- + 



/n 

Hence we have from an easy calculation: 



S^M < Cmax m) ^ , MWAlF±i,^^ W 
Consequently, for any < t < 1, for n large enough, we have: 



2k t 



providcd that: 



KiX) < 



>S^it + \og\ogg n), 



C(A) 2» + p-l 7J2K + P-1 



1 + log log, n 

It remains to use Lemma 2 with t replaced by t + log log, n to obtain: 



c(A) 



< e" 



provided that the approximation function obeys to the following inequality: 



4q V 



□ 



6.2. Proof of Theorem 2 

Theorem 2 is a straightforward application of Theorem 1 to the particular case 
of errors in variables using deconvolution kernel estimators. 

First step is to check that the estimation procedure described in Section 3.1 
gives rise to a LB-class with respect to where ly is the Lebesgue measure on 



Lemma 3. Suppose (Al) holds and suppose l{g{-),y) S L2{X) for any y € 



y . Consider a deconvolution kernel lCri{t) = JF 



where lC{t) 



_-^[')](-/A) 

Iif^iJCi{ti) where Id have compactly supported and bounded Fourier transform. 
Then we have: 

IKa(5) - ^a(5')IIl,(p < ntiAr'''IK(5) - %')I1l2(..), 
and moreover: 

d 

SUp||£A(.9)||oo<nV'''^'^'- 

see t , 
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Proof. Wc have in dimension d — 1 for simplicity, using the boundedness as- 
smuptions: 



II^A(5)-^A(g')llL(p) 



X 



X 



{e{g{x),y))-l{g'{x),y)))dx 



Afy 



jK,.,{-)*{i{g{-),v)-t{g'{-),v)){z) 



Afy{z)dz 



< 



^T.Py fA\^['^viT)]mm9i-),y)~ei9'i-),y)Kt)\'dt 

yey •'^ 



< C'A-^^||%)-£(g')llL(.,), 
where we use in last line the following inequalities: 



^|^[/C„(./A)](s)|' = |^[/C„](sA)|'<sup 
A tea 



< sup C 



K_ _Ki 

A ^ A i 



provided that J-[IC] is compactly supported. 

By the same way, the second assertion holds since if i{g{-), y) S L^{X): 



sup |€A(g, (z,y))| < sup 

(z,y) {z,y) J X 



A^n^ 



Z ~ X 



%(^),2/)) 



dx 



< C sup ' 

zex V Jx 



A 



dx 



A straightforward generalization leads to the d-dimensional case. 



□ 



The last step is to get an approximation function for the class {i\[g), g ^ Q} 
with the following lemma: 



such that K 



Lemma 4. Suppose (Rl) holds and /C,,(t) ^ F ^ 
is a kernel of order 7 with respect to the Lebesgue measure. Then if {i{g) 
^{9')t9t9' £ is Bernstein with parameter k> 1, we have: 



^9,9' e 5, {R^ - Ri){g - g') < a{X) + r{Re{g) - Ri{g')), 



where 



^ 2/c'y 1 

a(A)=C^Af^ andr^—. 



Moreover, if \e{g{x),y) - e{g' {x),y)\ = \e{g{x),y) - £{g' {x),y)\^ and k > I, 
have: 



a(A) ^ C^X;-' and . 
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Proof. We consider the case d — 1 fro simplicity. Using the elementary property 
EKj^ {^^) = EK (^Y^), gathering with Fubini, we can write: 



(R^ - R,){g ^ g') = Y.Pv f K{u){iig{x),y) ~ £{g\x),y)) {fy{x + Xu) - fy{x)) . 
Now since the /^'s has I — [7J derivatives, there exists r g]0, 1[ such that: 



/ K{u){fy{x + \u)- fy{x))du < f K{u)lj2 
Jx Jx \^^^ 



fc! 



- Ix^^^""^ (^^if'^'Hx + r\u)-f'^'Hx))^du 



< / H^duKCX-^, 



X 



where we use in last line the Holder regularity of the fy 's and that /C is a kernel 
of order I = [7J . 

Using the Bernstein assumption, one gets: 

{R^ ~ Re){g ^ g') < CA^ ^ / y) - 2/)|cia;. 

,,-1, Jx 



yey 



\ 



l'je{g{x),y)-eig'{x),y)\dx^ 

< C||£(5)-^(ff')|U,HA^ 

< CX''{Re{g)-Ri{g'))^ 

< + ^iR,{g)-Ri{g')), 

where we use in last line Young's inequality: 

xy^ < ry + x^^^^"^ ,\fr < 1, 

with r = 

For the second statement, if \£{g{x),y) — £{g'{x),y)\ = \£{g{x),y) — £{g' {x) , y)\'^ 
and K > 1, it is straightforward that 2k can be replaced by k to get the result. □ 

Proof of Theorem 2. The proof is a straightforward application of Theorem 1. 
From Lemma 3 and Lemma 4, condition (2.6) in Theorem 1 can be written: 



Applying Theorem 1 with a smoothing parameter A such that equalities hold 
above gives the rates of convergence. □ 
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6.3. Proof of Theorem 3 

First step is to check that the estimation procedure described in Section 3.2 
gives rise to a LB-class with respect to vy with the foUowing lemma. 

Lemma 5. Suppose (A2) holds and l{g{-),y) G L2(y) for any y ^y. Then we 
have: 



¥x{9) ^a(5')IIl.(P) < N^¥{9) - ^{9)\\l.M, 



and 



moreover: 



sup|KA(g)||oo<iV^+^/'. 
see 

Proof. The proof follows the proof of Lemma 3. We have in dimension d = 1 
for simplicity since {4>k)k&i is an orthonormal basis and using the boundedness 
assumptions over the fyS: 



P»(9)-«K(s')lli,,p, 



yey •''^ \k=\ •'■^ 



M^)(t>k{xmg{x),y)) - l{g'{x),y)))v{dx) Afy{z)y{dz) 



{t{g{x),y)) - £{g'{x),ymu{x)v{dx) v{dz) 



yey k=l •'■^ ^•^■^ 

< CN'^Y^Pyflil {eig{x),y))-eig'{x),y)))MxMdx) 

yey k=i ^•''^ 



< civ^^||%)-%')llL(.,)- 

By the same way, the second assertion holds since if £{g) G L^{v): 



SMY>\tx{g,{z,y))\ < sup 



< sup 



N 



Vfefc^ / 4'kix)(j)k{z)e{g, {x,y))v{dx) 
k=i •'^ 



N 



(^^2/) \ k=l \ k=l 



N 



J2 / Mm9,ix,y)Mdx)) M^y 



□ 

The last step is to control the bias term of the procedure with the following 
lemma: 

Lemma 6. Suppose (R2) holds and {(.{g) — ^(ffOiff/G Q} is Bernstein with 
parameter k > 1. Then we have: 



^9,9' e G, {Ri - Re){9 - g') < «(A) + r{R,{g) - R,{g')f 
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where 



Moreover, if \i{g{x),y) - £{g' {x),y)\ = \e{g{x),y) - l{g'{x),y)\'^ and k > 1, we 
have: 

d 

a[N) = Cy2^i andr=-. 

i=l 

Proof. We first write, with EzvO\ = 9\ fy{x)(f>k{x)i^{dx): 

N 

R^ig)=ER^{g) = U{g{x),y)Y,OlMx)y{dx) 

■l^ k=i 

N 

= Y.Py h^9{x),y)Y^¥.zyel^u{x)v{dx) 



yey 



vey 



N 



Y^Py / l{g{x),y)Y^elMx)v{dx) 



Hence we have: 



{R^,-R,){g-g') = J^Py [ y) - y)) [ - I K^^) 

yey \fc=i k>i J 

yey ''■^ k>N 

Using Cauchy-Schwarz twice, we have since {(t>k)keN in an orthonormal basis 
and provided that fy S 6(7, L): 



yey 



Y.Py{ W{x),y) - l{g{x),y))cj,u{x)u{dx) 




< JT^Pv I {^{9{x),y)-t{g'{x),y)Yv{dx) j 4>l{x)v{dx) . Y.Pv\Y.^l] 
V y<^y \ yey \k>N J 



yey V '^>^ 



< C {R,{g) - Rdg'))^- Y Py^'\ T. (^^)'^'^ 

yey Y '^>^ 

< C {Re{g) ^ Ri{g'))^ N-'' . 

We conclude the proof using Young's inequality exactly as in Lemma 4. □ 
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Proof of Theorem 3. The proof is a straightforward application of Theorem 1. 
From Lemma 5 and Lemma 6, condition (2.6) in Theorem 1 can be written: 

jy-sTT^ < ^ N < n2^(2K+p-i)+2(2K-i)fi ^ 



\/n J 

Applying Theorem 1 with a smoothing parameter N such thatan equality holds 
above gives the rates of convergence. □ 

6.4- Proof of Lemma 1 

The proof uses the maximal inequality presented in van der Vaart and Wellner 
[1996] to the class: 

- {Ug) - h{g'), g,g'eg: PiKg) - i{g')? < S'}. 

Indeed from Theorem 2.14.2 of van der Vaart and WcUner [1996], we can write, 
V?7 > 0: 

c^,-,ig,s,^l) = E sup (Pn - P)iix{g) - ix{g')) 

9,S'ee:|KO)-^(9')ll?.,(,,)<'S2 
\\Ff ~ rr, 

sup/e^||/|li^(P) 



+ 



Yi + Hb(-F,7?||F||2^^^^,L2(a*)) (6.1) 



where -F(z, y) — sup^^jr \l^\{g^ y) — i\{g' , z, y)\ is the enveloppe function of the 
class F. Since {l\{g),g e 5} is a LB-class with bounded constant K{\): 

ll^liL(P) = j F^{z)P{dz,dy) 

= {^^P\^>^ig^z,y)-ix{g',z,y)\] Afy{z)v{dz) 



Moreover, we have since {l\{g),g € 5} is a LB-class with respect to fi with 
Lipschitz constant c(A): 

nemg). g e Q}. e, L2{^i)) < ce-^P ^ nB{:F, e, L2{P)) < c{XfPe-^P. 
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Hence, we have in (6.1), choosing 77 





n Jo 



r v/l + (-^PK{X)-^Pc{X)^Pde + ^l + T]-^PK{X)-^Pc{X) 



2p 



< 



?]K{X)^ ri^-PK{Xf(^-P^c{X)P c{X)5 ciXf+Pr^-PRiXy^PS 



< 




provided that 5 < 1. 
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